The psychophysics of home plate umpire calls

We analyze the visual perception task that home plate umpires (N = 121) perform calling balls and strikes (N = 3,001,019) in baseball games, focusing on the topics of perceptual learning and bias in decision-making. In the context of perceptual learning, our results show that monitoring, training, and feedback improve skill over time. In addition, we document two other aspects of umpires’ improvement that are revealing with respect to the nature of their perceptual expertise. First, we show that biases in umpires’ decision-making persist even as their overall accuracy improves. This suggests that bias and accuracy are orthogonal and that reduction of bias in decision-making requires interventions aimed specifically at this goal. Second, we measure a distinct difference in the rate of skill improvement between older and younger umpires. Younger umpires improve more quickly, suggesting that the decision task umpires engage in becomes routinized over time.


Prior Distributions
To complete the Bayesian model, we apply prior distributions to the model parameters.We apply standard normal priors, N (0, 1), to each of the contextual covariates in Equation 1, excluding the global intercept terms and the umpire-specific intercept terms.The global intercept terms receive informative prior distributions as described below.The umpire-specific terms receive a hierarchical prior distribution as follows.For each umpire, we construct a 6-dimensional vector, called ζ u , that contains the umpire's intercept for each of the six strike zone parameters.These ζ u vectors are drawn from a hierarchical multivariate normal prior with mean µ ζ and variance-covariance matrix Σ ζ , as in Equation 1.We give each element of the hyperparameter µ ζ a normal prior centered on 0 with variance 1 and we give Σ ζ an LKJ prior with parameter value 2 [3].
Conveniently, the rule book strike zone provides specific physical measurements of the strike zone that constitute a good initial guess at the true strike zone parameters.Our prior distributions are based on the values set by the rulebook.The rulebook stipulates that the width of the strike zone is 17 inches.Because the α parameter measures one half the width of the strike zone, we expect it to be about 17 24 feet.Based on this, we give the α intercept parameter a normal prior centered on log 17 24 = −0.34 with variance 1.
The height of the strike zone depends on the batter's stance, but it is generally about two feet high.In our model, the height of the strike zone is equal to 2λα, which suggests a λ value of 24 17 .We give the λ intercept parameter a normal prior centered on log 24 17 = 0.34 and variance 1.
Based on the rulebook definition, the x 0 parameter should be 0. Like the λ parameter, the y 0 parameter will change depending on the stance of each batter.Generally, the vertical center of the strike zone will be at about 2.5 feet.We give the x intercept 0 parameter a normal prior distribution centered on 0 with variance 1 and give the y intercept 0 parameter a normal prior distribution centered on log 2.5 = 0.92 with variance 1.
x intercept 0 ∼ N (0, 1) For the β and r parameters, the rulebook stipulates values of −∞ and ∞, respectively.As these are parameters of special importance in the context of studying perception, we want to avoid influencing the outcome with our choice of prior.For that reason, we choose relatively diffuse priors.For the β intercept parameter we use a normal prior with mean 4 and variance 1.For the r intercept parameter we use a normal prior centered on 1 with variance 1.The priors on β and r are difficult to evaluate in substantive terms.
Because the parameters are unit-free scaling factors, the substantive meaning of each parameter's scale is not straightforward.We think that our choices of priors are reasonable for at least two reasons.First, there are enough data to overwhelm the priors.Second, we are primarily interested in changes over time and we apply the same prior to each year.If we are introducing unreasonable bias through our priors, then we are introducing the same bias in each year.

Definition of Bias
For each strike zone parameter, bias is defined to be the difference in that parameter's value between two combinations of covariates-corresponding to two in-game contexts-that differ in only one of the covariates.
For example, we estimate home-team bias by comparing each strike zone parameter value when the home team is pitching to the parameter value when the away team is pitching, while holding all other covariates constant.Each measure of bias must be defined for a specific strike zone parameter.We can estimate home-team bias in strike zone width, height, consistency, etc.
In the paper, we measure home team bias for the game context of a right-handed pitcher and right-handed batter, and in a 0-0 count.The home-team bias in strike zone width, which we call τ Home,α , is defined as Technically, α 0-0 = α top = 0 because a 0-0 count is the reference value for α count and the top of the inning is the reference value for α side .We include those terms in our explanation for completeness.This measure will be larger when umpires favor home-team pitchers, as the first term is the width parameter given to home-team pitchers and the second term is that given to away-team pitchers.
We calculate count bias for each count relative to a 0-0 count.For example, bias in width in a 3-0 count, called τ 3-0,α , is defined as All measures of bias that we present are calculated in this way, other than those for x 0 .Note that bias in x 0 can be measured directly as the difference in parameter estimates for different values of each covariate because the exponential function does not appear in the calculation of x 0 .For example, home-team bias in strike zone width is . This is because x top 0 = 0, as it is the reference value for x side 0 .

Robustness Checks
In this section, we report two additional specifications of the cohort model to show that the substantive result does not depend on our decision to restrict the dataset to umpires who debuted in 2008 or earlier.Our first model is identical to the model presented in the main paper, but it includes the entire set of umpires.The results are shown in Figure S3.The primary substantive result, that older umpires improve more slowly, persists.This is apparent in the right-hand panel.
The primary difference between the results of this model and those of the model from the main paper is that umpires who debuted after 2008 have higher starting values of β.That is, when they debut, they have higher consistency than their older counterparts had in their own debuts.This is probably a combination of two factors: first, as we suggested in the main paper, training at the minor league level may depend indirectly on observations derived from the PITCHf/x system.That is, the umpires who debut after 2008 received better training while in the minor leagues than their older counterparts did.Second, for an umpire who debuted after 2008, the average consistency of an MLB umpire was higher at his debut than it was in 2008.This is a result of the general skill improvements we report in this paper.Presumably, this means that after 2008, minor league umpires had to display a higher skill-level on this particular skill in order to qualify for the major leagues, possibly with higher weight placed on these more objectively measured ball-strike calls relative to previous years.
Our second robustness check model includes the entire set of umpires but adds an additional covariate and an interaction term at the second level of the hierarchical model.The additional covariate is an indicator variable that is a one if the umpire debuted after 2008 and a zero if the umpire debuted during or before 2008.Given the point made in the last paragraph, that the threshold for entry into the major leagues may have changed after 2008, it is unreasonable to model post-2008 debutees' initial consistency measurements as exchangeable with pre-2008 debutees' measurements.Thus, we model them with an interaction term.
The results of this model are shown in Figure S4.The new specification of the second-level regression is as follows.The variable D u is the indicator variable for whether umpire u debuted after 2008 versus during or prior to 2008.
Again, the main substantive result persists in this model.The primary difference between the results of this model and those of the preceding model is that the relationship between age and rate-of-improvement among umpires who debuted after 2008 changes.In the first model it was similar to that for all umpires (this is not surprising, given that the hierarchical assumption shrank the estimates for these umpires towards the linear relationship fit with all umpires).In this model, it appears to be negative, which would indicate that the youngest set of umpires is actually getting less consistent over time.We suspect that this is not actually the case and that this negative estimate is a random result.The uncertainty in the estimates for the post-2008 debuting umpires is large, which makes sense given that we have fewer years of observation for those umpires.

Figure S1 :
Figure S1: Illustration of the strike zone and the umpire's judgment task.The strike zone is a threedimensional space floating above the edges of homeplate.According to the MLB rulebook, it's height is determined by the batter's body position, such that the bottom of the strike zone is at the hollow below the batter's knee and the top is at the midpoint between the batter's should and the top of the batter's pants.(This image is under a Creative Commons 2.0 license and is available online [1].)

Figure S3 :
FigureS3: These are the results from the first cohort analysis robustness model.The left plot shows the relationship between birth year and an umpire's initial consistency in his first year of observation under the PITCHf/x system.The right hand plot shows the relationship between an umpire's birth year and his rate of improvement in consistency over the time during which he is observed by the PITCHf/x system.

Figure S4 :
Figure S4: These are the results from the second cohort analysis robustness model.The plots are analogous to those in the preceding figure.